Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Ray Cluster/AppWrapper creation #650

Closed

Conversation

Bobbins228
Copy link
Contributor

@Bobbins228 Bobbins228 commented Aug 27, 2024

Issue link

Closes: RHOAIENG-10385 and RHOAIENG-8846

What changes have been made

  • Removed base_template.yaml
  • Removed generate_yaml.py in favour of build_ray_cluster.py
  • Added V1RayCluster API
  • Removed unneeded labels
  • Removed head_info
  • Removed machine_types
  • Changed app_wrapper_yaml to resource_yaml and removed app_wrapper_name
  • Made ray_version a variable for potential future automation
  • Set the image config variable to default to quay.io/rhoai/ray:2.23.0-py39-cu121
  • Moved is_openshift_cluster function to cluster.py
  • Changed create_app_wrapper to create_resource
  • Removed enableIngress catch logic
  • Removed app_wrapper_name
  • Removed _components_resources_down in favour of _delete_resources
  • Refactored the get_cluster method to generate a new ClusterConfiguration with just the name and namespace of the cluster and retrieved yaml. Added is_appwrapper bool so that users can get AppWrappers/Ray Clusters
  • Added _retrieved_cluster boolean for get_cluster command to avoid generating a "false" Ray Cluster via ClusterConfiguration
  • Updated Cluster Configuration documentation and added new doc for methods used when interacting with Ray Clusters/AppWrappers.

Verification steps

Setup

Notebook server ODH/RHOAI/Local

  • Clone this repository with git clone https://github.com/project-codeflare/codeflare-sdk.git
  • Checkout this PR's branch
  • Run poetry build - install if needed (pip install poetry)
  • Run pip install --force-reinstall dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl
  • Restart your notebook kernel

Testing

All ClusterConfiguration parameters must be tested with the new cluster creation method.

Keep a special eye out for the following as they were the most complex to implement:

  • worker_extended_resource_requests
  • head_extended_resource_requests
  • extended_resource_mapping
  • The notebook annotations.

Automated Notebook testing should cover the functionality changed but I still suggest all parameters should be human verified.

Test the new and improved get_cluster() function.

NOTE: You can compare the original & retrieved clusters by setting write_to_file=True on ClusterConfiguration and get_cluster()

  • Create a Ray Cluster
  • Restart your notebook kernel
  • Run cluster = get_cluster(cluster_name=<name>, namespace=<namespace>, is_appwrapper=False, write_to_file=True)
  • Run through the basic cluster. methods
  • Run cluster.down() then cluster.up()
  • Ensure the cluster is created successfully
  • Repeat steps for AppWrapper (Remember to enable AppWrappers in the CFO configmap!)

TODO

  • Add ImagePullSecrets # Done
  • Revise V1RayCluster

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@Bobbins228 Bobbins228 added the test-guided-notebooks Run PR check to verify Guided notebooks label Aug 27, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2024
Copy link
Contributor

openshift-ci bot commented Aug 27, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from bobbins228. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sutaakar sutaakar added test-guided-notebooks Run PR check to verify Guided notebooks and removed test-guided-notebooks Run PR check to verify Guided notebooks labels Aug 27, 2024
@Bobbins228 Bobbins228 changed the title [WIP] Refactor Ray Cluster/AppWrapper creation Refactor Ray Cluster/AppWrapper creation Aug 27, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2024
@Bobbins228 Bobbins228 added test-guided-notebooks Run PR check to verify Guided notebooks and removed test-guided-notebooks Run PR check to verify Guided notebooks labels Aug 27, 2024
Copy link
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magnum opus for sure @Bobbins228 ! Will continue reviewing but some nitpicks and a question so far.

docs/ray_cluster_interaction.md Outdated Show resolved Hide resolved
docs/ray_cluster_interaction.md Outdated Show resolved Hide resolved
docs/ray_cluster_interaction.md Outdated Show resolved Hide resolved
docs/ray_cluster_interaction.md Outdated Show resolved Hide resolved
docs/ray_cluster_interaction.md Outdated Show resolved Hide resolved
docs/ray_cluster_interaction.md Outdated Show resolved Hide resolved
docs/ray_cluster_interaction.md Outdated Show resolved Hide resolved
docs/ray_cluster_interaction.md Outdated Show resolved Hide resolved
docs/ray_cluster_interaction.md Outdated Show resolved Hide resolved
src/codeflare_sdk/cluster/cluster.py Show resolved Hide resolved
@Bobbins228 Bobbins228 force-pushed the refactor-ray-cluster branch 2 times, most recently from cfc037f to dab3020 Compare August 27, 2024 15:52
Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give this another pass tomorrow. Good work!

Comment on lines +1 to +190

@imagePullPolicy.setter
def imagePullPolicy(self, imagePullPolicy):
self._imagePullPolicy = imagePullPolicy

@property
def securityContext(self):
return self._securityContext

@securityContext.setter
def securityContext(self, securityContext):
self._securityContext = securityContext

@property
def idleTimeoutSeconds(self):
return self._idleTimeoutSeconds

@idleTimeoutSeconds.setter
def idleTimeoutSeconds(self, idleTimeoutSeconds):
self._idleTimeoutSeconds = idleTimeoutSeconds

@property
def upscalingMode(self):
return self._upscalingMode

@upscalingMode.setter
def upscalingMode(self, upscalingMode):
self._upscalingMode = upscalingMode

@property
def env(self):
return self._env

@env.setter
def env(self, env):
self._env = env

@property
def envFrom(self):
return self._envFrom

@envFrom.setter
def envFrom(self, envFrom):
self._envFrom = envFrom

@property
def volumeMounts(self):
return self._volumeMounts

@volumeMounts.setter
def volumeMounts(self, volumeMounts):
self._volumeMounts = volumeMounts

def to_dict(self):
"""Returns the model properties as a dict"""
result = {}

for attr, _ in six.iteritems(self.openapi_types):
value = getattr(self, attr)
if isinstance(value, list):
result[attr] = list(
map(lambda x: x.to_dict() if hasattr(x, "to_dict") else x, value)
)
elif hasattr(value, "to_dict"):
result[attr] = value.to_dict()
elif isinstance(value, dict):
result[attr] = dict(
map(
lambda item: (item[0], item[1].to_dict())
if hasattr(item[1], "to_dict")
else item,
value.items(),
)
)
else:
result[attr] = value

return result

def to_str(self):
"""Returns the string representation of the model"""
return pprint.pformat(self.to_dict())

def __repr__(self):
"""For `print` and `pprint`"""
return self.to_str()

def __eq__(self, other):
"""Returns true if both objects are equal"""
if not isinstance(other, V1AutoScalerOptions):
return False

return self.to_dict() == other.to_dict()

def __ne__(self, other):
"""Returns true if both objects are not equal"""
if not isinstance(other, V1AutoScalerOptions):
return True

return self.to_dict() != other.to_dict()
Copy link
Collaborator

@KPostOffice KPostOffice Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these auto-generated or have you created them by hand?

Can you give more info about how they were created?

My main concern is how will we keep these up to date as the Ray API Spec evolves. I think this is definitely the correct approach I just want to raise this concern now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to auto generate these but from what I can see in the KubeRay repo there is no valid Open API generator file -> they should look like this for ref: pet-store.yaml

So I based the models files on the way they should be auto generated from what I can see in the Python K8s API and the specs from raycluster_types.go.

I am not a fan of how I generated these files but given the alternative is to re-create the base template within build_ray_cluster.py I am not sure how we should proceed.

write_to_file=write_to_file,
appwrapper=is_appwrapper,
)
cluster = Cluster(cluster_config, is_retrieved_cluster=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we turn this into something like:

cluster = Cluster.cluster_from_k8

I'm not a huge fan of is_retrieved_cluster flag leaking out into public apis. It feels very awkward to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit more on this?

That bool exists to prevent a Ray Cluster from being built in build_ray_cluster.py. Other wise when we create the limited ClusterConfiguration the Cluster object is created but you would get duplicated print statements for "Written to: {output_file_name} from first creating the limited cluster then from creating a file from the retrieved cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create a class function which has different logic which also returns a cluster. Something like:

cluster = Cluster.new_cluster_skip_create(cluster_config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are a hero, that is a great idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KPostOffice WDYT of the new changes for get_cluster I have added?

When trying to initialise the Cluster object I was getting exception errors for not providing a ClusterConfiguration. I tried to think of a way around this and I opted to set the CC as none when using get_cluster which would allow me to set the config and resource_yaml after initialisation.

I in turn added a warning if a user tried to specify ClusterConfiguration as None. I feel like this is still a bit crude and would welcome any further suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to spend some time pair programming this tomorrow? We can spitball some ideas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds good I can set up a meeting for later

src/codeflare_sdk/cluster/cluster.py Show resolved Hide resolved
tests/test-case-custom-image.yaml Outdated Show resolved Hide resolved
src/codeflare_sdk/utils/build_ray_cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/utils/build_ray_cluster.py Outdated Show resolved Hide resolved
@Bobbins228 Bobbins228 added test-guided-notebooks Run PR check to verify Guided notebooks and removed test-guided-notebooks Run PR check to verify Guided notebooks labels Sep 2, 2024
@Bobbins228 Bobbins228 added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 18, 2024
@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Bobbins228
Copy link
Contributor Author

Closing in favour of #751

@Bobbins228 Bobbins228 closed this Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. test-guided-notebooks Run PR check to verify Guided notebooks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants